Enhancing Chinese Word Segmentation Using Unlabeled Data

نویسندگان

  • Weiwei Sun
  • Jia Xu
چکیده

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data. The proposed approach trains a character-based and word-based model on labele...

متن کامل

Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation

This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervis...

متن کامل

Exploiting unlabeled internal data in conditional random fields to reduce word segmentation errors for Chinese texts

The application of text-to-speech (TTS) conversion has become widely used in recent years. Chinese TTS faces several unique difficulties. The most critical is caused by the lack of word delimiters in written Chinese. This means that Chinese word segmentation (CWS) must be the first step in Chinese TTS. Unfortunately, due to the ambiguous nature of word boundaries in Chinese, even the best CWS s...

متن کامل

Enhancing LSTM-based Word Segmentation Using Unlabeled Data

Word segmentation problem is widely solved as the sequence labeling problem. The traditional way to this kind of problem is machine learning method like conditional random field with hand-crafted features. Recently, deep learning approaches have achieved state-of-theart performance on word segmentation task and a popular method of them is LSTM networks. This paper gives a method to introduce nu...

متن کامل

Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctua...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011